Four-Dimensional Cellular Automata Acceleration Using GPGPU

Introduction

GPGPU is General Purpose computation on Graphics Processing Units — using a NVIDIA or ATI graphics card to perform non-graphical calculations for your application. This technique is described in detail in the book GPU Gems 2 and on the www.gpgpu.org web site. That web site links to the helloGPGPU_GLSL sample code, which I used as a starting point.

Modern graphics processors have multiple floating-point pipelines and support programmable operations via shader languages such as OpenGL Shading Language. As a sample application I have re-implemented a four-dimensional Cellular Automata in OpenGL Shading Language. For an explanation of cellular automata, see here.

Four-Dimensional Cellular Automata

For this demonstration I am using a continuous-valued cellular automata. Each cell has 3 floating-point values, (red, green, blue). On each time step the cell takes on a new value based on the values of its 8 neighbors in four dimensions (+/- row, col, lvl, wvl). The OpenGL three-dimensional viewing program allows the user to choose which W-plane to view, so a three-dimensional slice of four-dimensional space is displayed. The displayed cubes are translucent, with their opacity determined by the sum of the (red,green,blue) values. A white cube with values (1.0, 1.0, 1.0) is opaque.

OpenGL Shading language implementation

OpenGL shading language is similar to C. It has been extended with primitive datatypes such as two to four-dimensional vectors of rgba or xyzw. GPGPU computations are done by providing an array of input data in the form of a texture. I encoded my four-dimensional 16x16x16x16 space in a 256x256 texture by encoding the R and C dimensions in X as (16*R + C) and L and W in Y as (16*L + W). Floating point RGB textures are already supported so no encoding of the color data was necessary. The 4th channel of the color was filled with 0 or 1 to indicate whether the cell should always be blank. The 0th and 15th cell in each dimension was kept blank to prevent the automata from wrapping around the edges.

To perform the computation the input texture is applied to a quadrilateral and rendered in the frame buffer. As the texture is applied the custom shader — our cellular automata shader — is applied to each pixel. The shader code can be seen here, and the entire program is available here. The framebuffer is then copied back over the source texture. If desired the frame buffer can also be copied back out to the application program. This data is converted back into four-dimensional form for display in three dimensions. Alternately the two-dimensional contents of the frame buffer can be viewed directly. The following images are the two-dimensional texture representations of the above images:

Performance

The ultimate goal is of course to achieve better algorithm performance than can be achieved on a standard CPU. I have compared the performance of the 16x16x16x16xRGB four-dimensional cellular automata program implemented both on the CPU and the GPU. I have compared it both returning the data to the CPU for display in three dimensions, and retaining it in the GPU for direct two-dimensional display as a texture (the CPU 2-D case is not actually displayed). All performance measurements are on a 1.73 GHz AMD Athlon CPU and an ATI Radeon 9800 Pro GPU. Debian Linux, 2.4.26 Kernel, XFree86 4.3, ATI 8.12.10-1 drivers from here.

steps / second
1 CA step / frame 10 CA steps / frame 100 CA steps / frame 1000 CA steps / frame
CPU 3D 193 203 204 204
GPU 3D 16 144 485 637
CPU 2D 195 203 204 204
GPU 2D 599 645 653 657

data revised to remove most display overhead

The major limitation on the 3-D GPU performance is the glReadPixels() call to move the texture data from the GPU back to the CPU. In the 10 and 100 steps/frame cases the cellular automata is updated (stepped forward) 10 or 100 times before the data is copied back to the CPU. In the 2-D case the data is never copied back from the GPU to the CPU, since it is displayed directly.

Conclusion

General-purpose algorithms run on the GPU can outperform the CPU if movement of data onto and off of the GPU can be minimized. Ideally the algorithm should be able to iterate many times before exchanging data with the CPU. At present facilities for GPGPU programming are primitive. Higher-level C-like languages such as OpenGL Shading Language are available, but debugging is difficult. It is also difficult to determine what code modifications will enhance or destroy pipeline performance — much trial-and-error is required.

Special Thanks to Paul Gyugyi

Creative Commons License
Four-Dimensional Cellular Automata Images by Eric Rollins is licensed under a Creative Commons Attribution 3.0 Unported License.